ZZZ[7,ALS] - www.SailDart.org

perm filename ZZZ[7,ALS] blob sn#032366 filedate 1973-03-29 generic text, type T, neo UTF8
00002						March 29 1973
00004	
00006	
00008	
00010		Some Preliminary Experiments in Speech Recognition
00020			using Signature Table Learning
00030	
00040	       	 		   by
00050			R.B.Thosar and A.L.Samuel
00060	
00070		A limited amount of success has been achieved in the
00080		application of the signature table scheme of machine
00090		learning to the problem of automatic speech recognition.
00100		The scheme is based on the assumption that the recognition
00110		system must eventually employ a learning mechanism and that
00120		the acoustic part of the system must start by dealing
00130		with the recognition of fairly elemental speech segments
00140		rather than with words if it is to have general utility.
00150	
00160	
00170	
00180		This paper discribes the general philosophy that is being
00190	followed and some early results that have been obtained in an attempt
00200	to devise elements of a speech recognition system that is not
00210	dependent upon the use of a limited vocabulary and that can recognize
00220	continuous speech by a number of different speakers.
00230	
00240		Such a system should be able to function successfully either
00250	without any previous training for the specific speaker in question or
00260	after a short training session in which the speaker would be asked to
00270	repeat certain phrases designed to train the system on those phonetic
00280	utterances that seemed to depart from the previously learned norm. In
00290	either case it is believed that some automatic or semi-automatic
00300	training system should be employed to acquire the data that is used
00310	for the identification of the phonetic information in the speech. We
00320	believe that this can best be done by employing a modification of the
00330	signature table scheme previously discribed by one of us.
00340	
00345	The Overall System
00350	
00360		The over-all system is envisioned as one in which the more or
00370	less conventional method is used of separating the input speech into
00380	short time slices for which some sort of frequency analysis,
00390	homomorphic, LPC, or the like, is done. We then interpret this
00400	information in terms of significant features by means of a set of
00410	signature tables. At this point we define longer sections of the
00420	speech called EVENTS which are obtained by grouping togather varying
00430	numbers of the original slices on the basis of their similarity.This
00440	then takes the place of other forms of initial segmentation. Having
00450	identified a series of EVENTS in this way we next use another set of
00460	signature tables to extract information from the sequence of events
00470	and combine it with a limited amount of syntactic and semantic
00480	information to define a sequence of phonemes.
00490	
00500	
00510	Advantages of the Signature Table approach
00520	
00530		Signature tables can be used to perform four essential
00540	functions that are required in the automatic recognition of speech.
00550	These functions are: (1) the elimination of superfluous and
00560	redundant information information from the acoustic input stream, (2)
00570	the transformation of the remaining information from one coordinate
00580	system to a more phonetically meaningful coordinate system, (3) the
00590	mixing of acoustically derived data with syntactic, semantic and
00600	linguistic information to obtain the desired recognition, and (4) the
00610	introduction of a learning mechanism.
00620	
00630	An early form of Signature Table
00640	
00650		For those not familiar with the use of signature tables as
00660	used by Samuel in programs which played the game of checkers, the
00670	concept is best illustrated (Fig.1) by an arrangement of tables used
00680	in the program. There are 27 input terms. Each term evaluates a
00690	specific aspect of a board situation and it is quantized into a
00700	limited but adequate range of values, 7,5,and 3, in this case. The
00710	terms are divided into 9 sets with 3 terms each, forming the 9 first
00720	level tables. Outputs from the first level tables are quantized to 5
00730	levels and combined into 3 second level tables and, finally, into one
00740	third-level table whose output represents the figure of merit of the
00750	board in question.

00760		A signature table has an entry for every possible combination
00770	of the input vector. Thus there are 7*5*3 or 105 entries in each of
00780	the first level tables. Training consists of accumulating two counts
00790	for each entry during a training sequence. Count A is incremented
00800	when the current input vector represents a prefered move and count D
00810	is incremented when it is not the prefered move. The output from the
00820	table is computed as a correlation coeficient
00830				C=(A-D)/(A+D) The figure of merit for a board

00840	is simply the coefficient obtained as the output from the final
00850	table.
00860	
00870		The following three advantages emerge from this method of
00880	training and evaluation.
00890		1) Essentially arbitrary inter-relationships between the
00900	input terms are taken in account by any one table. The only loss of
00910	accuracy is in the quantization.
00920		2) The training is a very simple process of accumulating
00930	counts. The training samples are introduced sequentially, and hence
00940	simultaneous storage of all the samples is not required.
00950		3) The process linearizes the storage requirements in the
00960	parameter space. In the case shown this requires only 343 entries
00970	instead of the 105↑9 entries were the entire space to be represented.
00980	
00990		The chief dissadvantage of this simple form of table relates
01000	to the highly questional practice of using the correlation
01010	coefficient outputs from some tables as inputs to other tables. This
01020	defect has been overcome in a recent form of table described
01030	elsewhere. The simple system still works remarkablly well as will be
01040	seen by the results below.
01050	
01055	Signature Tables for Speech Recognition
01060	
01070		The signature tables, as used in speech recognition,must be
01080	particularized to allow for the multi-catagory nature of the output.
01090	Several forms of tables have been investigated. The initial form
01100	tested and used for the data to be presented in this paper uses
01110	tables consisting of two parts, a preamble and the table proper. The
01120	preamble contains: (1) space for saving a record of the current and
01130	recent output reports from the table, (2) identifying information as
01140	to the specific type of table, (3) a parameter that identifies the
01150	desired output from the table and that is used in the learning
01160	process, (4) a gating parameter specifying the input, that is to be
01170	used to gate the table, (6) the gating level to be used and (7)
01180	parameters that identify the sources of the normal inputs to the
01190	table.
01200	
01210		All inputs are limited in range and specify either the
01220	absolute level of some basic property or more usually the probability
01230	of some property being present. These inputs may be from the
01240	original acoustic input or they may be the outputs of other tables.
01250	If from other tables they may be for the current time step or for
01260	earlier time steps, (subject to practical limits as to the number of
01270	time steps that are saved).
01280	
01290		The output, or outputs, from each table are similarly limited
01300	in range and specify, in all cases, a probability that some
01310	particular significant feature, phonette, phoneme, word segment, word
01320	or phrase is present.
01330	
01340		We are limiting the range of inputs and outputs to values
01350	specified by 3 bits and the number of entries per table to 64
01360	although this choice of values is a matter to be determined by
01370	experiment. We are also providing for any of the following input
01380	combinations, (1) one input of 6 bits, (2) two inputs of 3 bits each,
01390	(3) three inputs of 2 bits each, and (4) six inputs of 1 bit each.
01400	The uses to which these differint forms are put will be described
01410	later.
01420	
01430		The body of each table contains entries corresponding to
01440	every possible combination of the allowed input parameters. Each
01450	entry in the table actually consists of several parts. There are
01460	fields assigned to accumulate counts of the occurrances of incidents
01470	in which the specifying input values coincided with the different
01480	desired outputs from the table as found during previous learning
01490	sessions and there are fields containing the summarized results of
01500	these learning sessions, which are used as outputs from the table.
01510	The outputs from the tables can then express to the allowed accuracy
01520	all possible functions of the input parameters.
01530	
01532	Operation in the Training Mode
01534	
01540		When operating in the training mode the program is supplied
01550	with a sequence of stored utterances with accompanying phonetic
01560	transcriptions. Each segment of the incoming speech signal is
01570	analysed (Fourier transforms or inverse filter equivalent) to obtain
01580	the necessary input parmeters for the lowest level tables in the
01590	signature table hierarchy. At the same time reference is made to a
01600	table of phonetic "hints" which prescribe the desired outputs from
01610	each table which correspond to all possible phonemic inputs. The
01620	signature tables are then processed.
01630	
01640		The processing of each table is done in two steps, one
01650	process at each entry to the table and the second only periodically.
01660	The first process consists of locating a single entry line within the
01670	table as specified by the inputs to the table and adding a 1 to the
01680	appropriate field to indicate the presence of the property specified
01690	by hint table as corresponding to the phoneme specified in the
01700	phonemic transcription. At this time a report is also made as to the
01710	table's output as determined from the averaged results of previous
01720	learning so that a running record may be kept of the performance of
01730	the system. At periodic intervals all tables are updated to
01740	incorporate recent learning results. To make this process easily
01750	understandable, let us restrict our attention to a table used to
01760	identify a single significant feature say Voicing. The hint table
01770	will identify whether or not the phoneme currently being processed is
01780	to be considered voiced. If it is voiced, a 1 is added to the "yes"
01790	field of the entry line located by the normal inputs to the table. If
01800	it is not voiced, a 1 is added to the "no" field. At updating time
01810	the output that this entry will subsequently report is determined by
01820	dividing the accumulated sum in the "yes" field by the sum of the
01830	numbers in the "yes" and the "no" fields, and reporting this quantity
01840	as a number in the range from 0 to 7. Actually the process is a bit
01850	more complicated than this and it varies with the exact type of table
01860	under consideration, as reported in detail in appendix B. Outputs
01870	from the signature tables are not probabilities, in the strict sense,
01880	but are the statistically-arrived-at odds based on the actual
01890	learning sequence.

01900	
01910		The preamble of the table has space for storing tweive past
01920	outputs. An input to a table can be delayed to that extent.This table
01930	relates outcomes of previous events with the present hint-the
01940	learning input.A certain amount of context dependent learning is thus
01950	possible with the limitation that the specified delays are constant.

01960	
01970		The interconnected hierarchy of tables form a network which
01980	runs increamentally, in steps synchronous with time window over which
01990	the input signal is analised.The present window width is set at 12.8
02000	ms.(256 points at 20 K samples/sec.) with overlap of 6.4 ms. Inputs
02010	to this network are the parameters abstracted from the frequency
02020	analyses of the signal, and the specified hint.The outputs of the
02030	network could be either the probability attached to every phonetic
02040	symbol or the output of a table associated with a feature such as
02050	voiced,vowel ect.The point to be made is that the output generated
02060	for a segment is essentially independent of its contiguous
02070	segments.The dependency achieved by using delayes in the inputs is
02080	invisible to the outputs.The outputs thus report the best estimate on
02090	what the current acoustic input is with no relation to the past
02100	outputs.Relating the successive outputs along the time dimension is
02110	realised by counters.
02120	
02122	The Use of COUNTERS
02124	
02126		The transition from initial segment space to event space is
02128	made posible by means of COUNTERS which are summed and reiniated
02129	whenever their inputs cross specified threshold values, being
02131	triggered on when the input exceeds the threshold and off when it
02133	falls below.  Momentary spikes are eliminated by specifying time
02170	hysteresis, the number of consecutive segments for which the input
02180	must be above the threshold.The output of a counter provides
02190	information about starting time,duration and average input for the
02200	period it was active.
02210	
02220		Since a counter can reference a table at any level in the
02230	hierarchy of tables, it can reflect any desired degree of information
02240	reduction. For example, a counter may be set up to show a section of
02250	speech to be a vowel,a front vowel or the vowel /I/.The counters can
02260	be looked upon to represent a mapping of parameter-time space into a
02270	feature-time space, or at a higher level symbol-time space.It may be
02280	useful to carry along the feature information as a back up in those
02290	situations where the symbolic information is not acceptable to
02300	syntactic or semantic interpretation.
02310	
02320		In the same manner as the tables, the counters run completely
02330	independent of each other.In a recognition run the counters may
02340	overlap in arbitrary fashion, may leave out gaps where no counter has
02350	been triggered or may not line up nicely.A properly segmented output,
02360	where the consecutive sections are in time sequence and are neatly
02370	labled, is essential for processing it further.This is achieved by
02380	registering the instants when the counters are triggered or
02390	terminated to form time segments called events.
02400	
02410		An event is the period between successive activation or
02420	termination of any counter.An event shorter than a specified time is
02430	merely ignored. A record of event durations and upto three active
02440	counters, ordered according to their probability, is maintained.
02450	
02460		An event resulting from the processing described so far,
02470	represents a phonette  - one of the basic speech categories defined
02480	as hints in the learning process. It is only an estimate of closeness
02490	to a speech category , based on past learning.Also each category has
02500	a more-or-less stationary spectral characterisation.Thus a category
02510	may have a phonemic equivalent as in the case of vowels , it may be
02520	common to phoneme class as for the voiced or unvoiced stop gaps or it
02530	may be subphonemic as a T-burst or a K-burst.The choices are
02540	based on acoustic expediency, i.e. optimisation of the learning
02550	rather than any linguistic considerations.However a higher level
02560	interpretive programs may best operate on inputs resembling phonemic
02570	trancription.The contiguous events may be coalesced into phoneme like
02580	units using diadic or triadic probabilities and acoustic-phonetic
02590	rules particular to the system.For example, a period of silence
02600	followed by a type of burst or a short friction may be combined to
02610	form the corresponding stop.A short friction or a burst following a
02620	nasal or a lateral may be called a stop even if the silence period is
02630	short or absent.Clearly these rules must be specific to the system,
02640	based on the confidence with which durations and phonette categories
02650	are recognised.
02660	
02670		While it would be possible to extend this bottom up approach
02680	still further, it seems reasonable to break off at this point and
02690	revert to a top down approach from here on. The real difference in
02700	the overall system would then be that the top down analysis would
02710	deal with the outputs from the signature table section as its
02720	primatives rather than with the outputs from the initial measurements
02730	either in the time domain or in the frequency domain. In the case
02740	of inconsistancies the system could either refer to the second choices
02750	retained within the signature tables or if need be could always go
02760	clear back to the input parameters. The decision as to how far to
02770	carry the initial bottom up analysis must depend upon the relative
02780	cost of this analysis both in complexity and processing time and
02790	the certainty with which it can be performed as compaired with the
02800	costs associated with the rest of the analysis and the certainty
02810	with which it can be performad, taking due notice of the costs in
02820	time of recovering from false starts.
02830